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Abstract 

Wc uniquely characterize two members of the Q-Clustering family in an axiomatic frame- 
work. We introduce properties that use known tree constructions for the purpose of char- 
acterization. To characterize the Max-Sum clustering algorithm, we use the Gomory-Hu 
construction, and to characterize Single-Linkage, we use the Maximum Spanning Tree. Al- 
though at first glance it seems these properties are 'obviously' all that are necessary to 
characterize Max-Sum and Single-Linkage, we show that this is not the case, by inves- 
tigating how subsets of properties interact. We conclude by proposing additions to the 
taxonomy of clustering paradigms currently in use. 

Keywords: Clustering properties, axioms, submodularity, Q-Clustcring 
1. Introduction 

Clustering is a ubiquitous task in unsupervised learning, finding application in a large 
array of fields other than Computer Science. Any task where one seeks to group together 
similar objects and separate dissimilar objects can be thought of as 'Clustering'. Although 
the p roblem as defined leaves open a great deal of interpretation ( Blum . 20091 : Guvon et al. 



200i), in this paper we concretely focus on a large class of ob jectives which can be op t imize d 
using Queyranne's algorithm for submodular optimization ( Quevranne . 1998 : Rizzi, 2000l ). 

Th is class of objectives for Q-Clustering was introduced and used in Narasimhan et sH 
(|200fil ). which we analyze from an axiomatic perspective. Rec ently there has been a sig- 
nificant amount of work on clustering axioms and prop e rties (lAckerman and Ben-Davidl . 
20081 : IZadeh and Ben-Davidl . l2009l : ICarlsson and Memoli l2010b|), and we build in this di- 
rection to provide properties for the class of algorithms which can be expressed as optima 
of Queyranne's optimization algorithm. 

Narasimhan et al. ( 20061 ) considered 2 objectives to which we add and analyze a new 



objective. The two objectives in the original Q-clustering paper were Single-Linkage and 
Minimum Description Length (MDL). To these we add the Max-Sum objective, which seeks 
to maximize the sum of similarities inside clusters. Queyranne's algorithm can optimize 
the Single-Linkage criterion perfectly for k clusters, but its perfect recovery is limited to 
only 2 clusters for Max-Sum and MDL. This is not surprising, since optimizing Max-Sum 
and MDL perfectly for /c > 2 is NP-hard. However, there exist factor 2 approximation 
algorithms which proceed by cutting the k — 1 most expensive edges of the Gomory-Hu tree 
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(|Gomorv and Hul . Il96ll ) associated with MDL and Max-Sum. Gomory-Hu Trees exist in 
general for any symmetric submodular function. Furthermore, the natural tree construction 
for Single-Linkage (SL) is the maximum spanning tree (MST), since the SL criterion can 
be optimized by cutting the k — 1 most expensive edges of the MST. 

We focus on the underlying principles behind the Q-Clustering family of objectives, and 
arrive at the conclusion that they are uniquely characterized by cutting tree edges from 
well known tree constructions. Furthermore, we prove that they are the unique clustering 
functions using the mentioned tree constructions. These constructions are defined formally 
in section II. 2[ We show that Max-Sum and Single-Linkage are axiomatically identical, 
except for which of the above trees they use in the course of their operation. Furthermore, 
we show that they are the only clustering functions determined by those trees. 

The three objectives we consider are quite different. The Single-Linkage objective has 
the advantage that since we are only comparing similarities, the similarity measure can be 
an ordered set (arithmetic operations on the similarities need not be defined). We formalize 
the property that allows this criterion to only depend on the rank ordering of the similarities, 
so it is insensitive to monotone transformations of the similarities. This flexibility is useful 
in applications where only rankings are available, for example user studies in which humans 
only provide rankings instead of actual similarity scores. Unfortunately Single-Linkage is 
also very sensitive to outliers since all it takes to merge two clusters is a single path between 
the two clusters where all similarities are above a threshold. Since there are generally many 
paths between two clusters, this is a stringent requirement on the designers of the similarity 
matrix. The natural tree construction associated with the Single-Linkage Criterion is the 
Maximum Spanning Tree (MST) defined formally in the Section [1.21 

The second objective we consider is the Max-Sum objective, which doesn't suffer from 
the single-path problem of singe-linkage, but has the tendency to create a single large cluster, 
since having more edges in the objective summation is usually better than fewer. The tree 
construction we use for this is the Gomory-Hu tree, explained formally in Section 11.21 

The third objective we consider is probabilistic in nature and is based on the Minimum 
Description Length principle. We are given a distribution for each data item, and we 
attempt to find clusters so that describing or encoding the clusters (separately) can be 
done using as few bits as possible. It is known that the problem of finding the optimal 
clusterings minimizing the descr iption length is equivalen t to the problem of minimizing a 
symmetric submodular function ( Narasimhan et al. . 20061 ). The tree construction associated 



with this objective is the cut-equivalent tree associated with the symmetric submodular 
MDL function. We do not provide a a uniqueness theorem for MDL since it is not clear 
how to define some our axioms in this setting. 

It is important to note that we make a stark contrast between axioms and properties. 
Although they both restrict the class of partitioning functions, we expect axioms to appeal 
to our intuition about clustering, whereas properties are simply restrictions on the class of 
partitioning functions, they need not appeal to any intuition about clustering. 

1.1 Previous Work 



In add ition to submodular optimization algorithm bv lQuevrannd (11998 ) generalized by lRizzi 



(|200d l. our formal framework is based on Kleinberg's (|Kleinberg| . |2003| ) . We also adopt two 
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of the three axioms p roposed in that p aper, Consistency and Scale Invariance. We replace 
the Richness axiom of lKleinbertj ^200i ) by its version for the case of fixed number of clusters, 
fc-Richness. 

An axiomatic characterization of Single-Linkage is available in IZadeh and Ben-David 
( 20091 ), but they d o not p rovide any characterization of Max-Sum or submodular objec- 
tives. Flake et al. ( 20041 ) present some algorithms which use Minimum Cut Trees for 
the purpose of Clustering, but they do not discuss axioms or uniqueness theorems. An- 
other line of attack for characte r izing c luste ring methods is using tools from Topology 
explored in Carlsson and Memoli ( 2010al ) and Carlsson and Memoli ( 20081 ). Instead of fix- 
ing fc, there are other ways of c ircumventing Kleinberg's impossibility result analyzed in 



Ackerman and Ben-DavidI (120091), and a charact erization of the class of hierarchical cluster 



ing functio ns is given in Ackerman et al. ( 20ld). Submodular objec tives for clustering are 



explored in jjegelka and BilmesI ( 2010l) andlNarasimhan et al. ( 20061 ) . 

Finally, Balcan et al. (2008) and Awasthi and Zadeh ( 20ld ) present a framework which 
assumes that, given some data, a target clustering is achieved by interacting with a teacher. 
Interacting with a teacher is a departure from unsupervised learning. 



1.2 Formal Preliminaries 

A partitioning function acts on a set 5 of n > 2 points, and pairwise similarities among the 
points in S. The points in S are not assumed to belong to any specific set; the pairwise 
similarities are the only data the partitioning function has about them. Since we wish to 
deal with point sets that do not necessarily belong to a specific set, we identify the points 
with the set S = {1,2, ...,n}. We can then define a similarity function to be any function 
s : S X S ^ M"*" such that for distinct i,j G S, we have s{i,j) > 0, i.e. s must be positive 
symmetric, but there is no requirement of triangle inequality. 

Sometimes we write s = (ei, 62, . . . , e^n^) to mean the set of edges that exist between all 

pairs of n points. This list is always ordered by decreasing similarity. w{e) is the weight 
of edge e which connects some two points i,j. So w{e) = s{i,j). 

A partitioning function is a function F that takes a similarity function s on S x S and 
returns a ^-partitioning of S. A /c-partitioning of S" is a collection of k non-empty disjoint 
subsets of S whose union is S. The sets in F{s) will be called its clusters. Two partitioning 
functions are equivalent if and only if they output the same partitioning on all values of s 
- i.e. functionally equivalent. In the next section we define several trees, each associated 
with a particular graph G. 

A natural representation for a similarity function s = (ei, 62, . . . , e^np is a complete 
weighted graph Gg, whose n nodes correspond to our objects, and edges correspond to 
similarity scores assigned by s. Note that when there is no ambiguity about which similarity 
function is being used, we drop the subscript s and simply use G. Also, when it is natural 
to think of the complete graph for the similarity function, we simply use s to refer to the 
graph, instead of Gs- Now some trees will be associated with a particular s. 



1.2.1 Minimum Cut Tree 

Let G = {V, E, s) be any arbitrary weighted undirected connected graph with \V\ = n. Two 
disjoint subsets A and B of V that also cover V define a cut in G. The sum of the weights 
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of the edges crossing the cut defines the cut value. For two nodes s, t, a minimum s-t cut is 
a cut of minimum value that separates nodes s and t. For an integer k, a minimum A;-cut 
is a cut of minimum value that leaves exactly k connected components. 

The Minimum Cut Tree of G - call it MCT(G) - is a tree which has the same nodes 
as G, but a different set of edges and weights. The edges of the Minimum Cut Tree of G 
satisfy the following two properties. 

• For every two nodes s,t G V, the minimum cut in G that separates these points is 
given by cutting the smallest edge on the unique path between s and t in MCT(G'). 

• For every two nodes s,t S V, the weight of the smallest edge on the unique path (in 
MCT(G)) connecting them equals the size of a minimum s-t cut of G. 

Note that MCT(G) is not a subgraph of G as it has different edges and weight function. 
One may ask if such trees always exist, and indeed for every undirected graph , there always 
exists a min-cut tree, and they were initially introduced bv lGomorv and Hu 

Minimum Cut Trees are not always unique, so we define a canonical MCT function 
which fixes a particular o rdering on pairs o f poin ts (the lexicographical ordering), and uses 



the algorithm defined in (jComorv and Hul . Il96ll ) and outlined momentarily. Under these 



conditions, the output of the Gomory-Hu algorithm will be deterministic (thus unique) and 
we will denote its value MCT(G). 

Given an input graph G = {V, E) the Gomory-Hu algorithm maintains a partition of V , 
(Si, S2, ■ ■ ■ , St) and a spanning tree T on the vertex set {S*!, . . . , Sf}. Let w' be the function 
assigning weights to the edges of T. Tree T satisfies the following invariant. Invariant: 
For any edge {Si,Sj) in T there are vertices a and b in Si and Sj respectively, such that 
w'{Si,Sj) = f{a,b), where / denotes the minimum cut/maximum flow function, and the 
cut defined by edge {Si,Sj) is a minimum a-b cut i n G. This invar iant is maintained for 
n — 1 steps, after which T is a Gomory-Hu tree of G ( Vazirani . 200 ll ). Each step involves a 
call to a standard s-t cut Min-Cut/Max-Flow algorithm. 

1.2.2 SUBMODULAR FUNCTIONS 

Fix a finite set S. Submodularity is a property enforced on functions mapping subsets 
of S set to the reals. Intuitively, a submodular function over the powerset demonstrates 
"diminishing returns" . From a practical perspective, there exist polynomial time algorithms 
for minim izing submodu lar functions. In this sense they are the discrete analog of convex 
functions ( Lovaszl . Il983l ). Let S be a finite set. A function /: 2'^ — > M is submodular iff for 
all subsets A and -B of S we have, 

f{A) + f{B)>f{AnB) + fiAuB) 

Furthermore, a symmetric function is one for which f(S \ ^) = f{A). An important note 
is that Gomory-Hu trees work and can be generalized because the cut function is both 
submodu lar and sy i nmet ric. Any submodular symmetric function will induce a Gomory- 
Hu tree (jSchriiveii . l2003l ). An example of a symmetric submodular function is mutual 
information between two sets of random variables X and Y, 

p{x,y) 



I{X;Y) 




Y JX 



p{x, y) log 



Pi{x)p2{y) 



dx dy, 
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where p{x, y) is the joint probabihty distribution function of X and y, and pi{x) and P2{y) 
are the marginal probabihty distribution f unctions of X and Y respectively. For a deeper 
discussion and proofs of submodularity, see iToshevI (|2ninl ). 



1.2.3 Maximum Spanning Tree 

Given a connected, undirected weighted graph G, a spanning tree of G is a subgraph which 
is a tree and connects all the vertices. A single graph can have many different spanning 
trees. The weight of a spanning tree is computed as the sum of the weights of the edges in 
the spanning tree. A Maximum Spanning Tree (MST) is then a spanning tree with weight 
greater than or equal to every other spanning tree. 

Similar to Minimum Cut Trees, MSTs also have a rich history. They can be computed 
efficiently by Kruskal's algorithm. Maximum Spanning Trees are not always unique, but in 
the case that we fix an edge ordering, it is well known that they are unique for a particular 
G. We denote the canonical Maximum Spanning Tree of a graph as MST(G). 



1.2.4 Single-Linkage 

Single-Linkage is the clustering function which starts with all points in singleton clusters, 
and successively merges clusters until only k clusters are left. The similarity of two clusters 
is the similarity of the two most similar points inside differing clusters. 

Another way to compute the Single-Linka ge fc-partitioning i s to c ut the k — 1 smallest 
edges of the Maximum Spanning Tree of Gg ( Gower and Rossi . 19691 ). It should be noted 



that the behavior of Single-Linkage is robust against small fluctuations in the weight of the 
edges in s, so long as the order of edges does not change, which can be readily seen from 
Kruskal's algorithm. 



1.2.5 Max-Sum 

The objective of Max-Sum is to maximize A^ (L) = Ylc&v X^i jec ^i^^j) ^'^^'^ /c-partitionings 
r = {A^B}. Finding the optimal partitioning is NP-hard for k > 2. However, for k = 2 
finding the optimal Max-Sum 2-partitioning is the same as finding the global minimum cut 
and thus poly-time computable. Finding the overall minimum cut is equivalent to cutting 
the smallest edge of the Minimum Cut Tree, so for k = 2 Max-Sum can be reinterpreted as 
the algorithm which cuts the smallest edge of the Minimum Cut Tree. 

Since it is not computationally feasible to optimize the above objective function, we 
define th e following app roximation algorithm which has a guaranteed approximation factor 
2 - 2/k (jVaziraniHoOll '): simply iteratively find and remove the global minimum cut until 



exactly k connected components remain. This algorithm is called the "MaxCut" clustering 
function throughout. 



2. Uniqueness Results 
2.1 Axioms 

Now in an effort to distinguish clustering functions from partitioning functions, we review 
some axioms (not properties) that one may like a clustering function to satisfy. Here is the 
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first one. If s is a similarity function, then define a • s to be the same function with ah 
similarities multiplied by a. 

Scale-Invariance. For any similarity function s, 1 < k < n, and scalar a > 0, we have 

F{s,k) = F{a ■ s,k) 

This axiom simply requires the function to be immune to stretching or shrinking the 
data points linearly. It effectively disallows clustering functions to be sensitive to changes 
in units of measurement - which is desirable. We would like clustering functions to not have 
any predefined hard-coded similarity values in their decision process. 

The next axiom ensures that the clustering function is "rich" and not crippled in types 
of partitioning it could output. For a fixed S, Let Range(F(«)) be the set of all possible 
outputs while varying s. 

A;-RlCHESS. Range(F{»,k)) is equal to the set of all k-partitionings of S 

In other words, if we are given a set of points such that all we know about the points are 
pairwise similarities, then for any partitioning F, there should exist a s such that F{s) = 
F. By varying similarities amongst points, we should be able to obtain all possible k- 
partitionings. 

The next axiom is more subtle and was initially introduced in iKleinberej ^200i ). along 



with richness. We call a partitioning function "consistent" if it satisfies the following: 
when we increase similarities between points in the same cluster and decrease similarities 
between between points in different clusters, we get the same result. Formally, we say that 
s' is a T-transformation of s if (a) for all i,j & S belonging to the same cluster of F, we 
have s'{i,j) > s{i,j); and (b) for all i,j £ S belonging to different clusters of F, we have 
s'ii,j) < s{i,j). In other words, s' is a transformation of s such that points inside the same 
cluster are made more similar and points not inside the same cluster are made less similar. 

Consistency. Let s be a similarity function, and s' be a F {s , k) -transformation of s. 

Then F{s, k) = F{s', k) 

In other words, suppose that we run the partitioning function i*" on s to get back a 
particular partitioning F. Now, with respect to F, if we shrink in-cluster similarities or 
expand between-cluster similarities and run F again, we should still get back the same 
result - namely F. 

The difference between these and the axioms defined bv lKleinber^ hm± is that at all 



times, the partitioning function F is forced to return a fixed number of clusters. If this 
were not the case, then the above axioms could never be satisfied by any function. In most 
popular clustering algorithms such as fc-means, Single-Linkage, and spectral clustering, the 
number of clusters to be returned is determined beforehand - by the human user or other 
methods - and passed into the clustering function as a parameter. Note that it is not clear 
how to extend Consistency to the case where there is a distribution associated with every 
point, since there is not an obvious definition of similarity between two points, which is 
needed to define Consistency in terms of increasing or decreasing similarities. 

Definition 1 A Clustering Function is a partitioning function that satisfies Consistency, 
Scale Invariance, and k-Richness. 
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We've already introduced two clustering functions. 

Theorem 2 Single-Linkage and Max-Sum are Clustering functions. 

Both Single-Linkage and Max-Sum are clustering functions, with proofs available by 
Zadeh and Ben-DavidI (|2009l ). So a natural question to ask is what types of properties (and 



not axioms) distinguish these two clustering functions from one another? We introduce two 
new properties that one may not expect all clustering functions to satisfy, but may at times 
be desirable. 



2.2 Properties 

For ease of notation, let MST(s) be the Maximum Spanning Tree of Gs- Similarly, let 
MCT(s) be the Minimum Cut Tree of Gg- Now we are ready to define two distinguishing 
properties. 

MST-CONSISTENCY. If s and s' are similarity functions such that MST(s) and MST(s') 
have the same minimum k-cut, then F{s, k) = F{s\ k) 

In other words, a clustering function is MST-Consistent if it makes all its decisions based 
on the Maximum Spanning Tree. Note that this does not mean the algorithm must optimize 
any particular objective, just that its decisions are based on the MST. It is important to 
note that this property includes both the weights on the edges and the structure of the 
MST. Single-Linkage satisfies this property since it cuts the smallest edge k — 1 edges of the 
MST. Note that this is a property^ not an axiom - we don't expect all clustering functions 
to make their decisions using the Maximum Spanning Tree. Later in this section we show 
that Single-Linkage is the only clustering function that is MST-Consistent. 

Similarly define MCT-Consistency identical to MST-Consistency, with MST replaced 
with MCT. 

MCT-CONSISTENCY. If s and s' are similarity functions such that MCT(s) and MCT(s') 
have the same minimum k-cut, then F{s,k) = F{s',k) 

This property forces a clustering function to make all its decisions based on the Minimum 
Cut Tree. Max-Sum satisfies this property since it always cuts the smallest k — 1 edges of 
the minimum cut tree. We show that Max-Sum is the only clustering function that is 
MCT-Consistent. 

Notice that both MST-Consistency and MCT-Consistency imply Scale-Invariance; mean- 
ing that if a function is either MST-Consistent or MCT-Consistent, then it is also Scale- 
Invariant. For this reason, whenever a function satisfies {MCT, MST}-Consistency, we 
ignore Scale-Invariance. 



2.3 Uniqueness Theorems 

Shortly, we will be showing that Single-Linkage and Max-Sum are uniquely characterized 
by MST-Consistency and MCT-Consistency, respectively. Before doing this, we reflect on 
the relationships between subsets of our axioms and properties. In doing so, we show that 
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MST/MCT-Consistency properties are not enough to characterize SL and Max-Sum, as one 
might think from a shallow glance. 

We will show that if any of the axioms or properties that we use for proving uniqueness 
were to be missing, then the uniqueness results do not hold. This means that all axioms 
and properties are really necessary to prove uniqueness of Single-Linkage and Max-Sum. 

Theorem 3 Consistency, MCT- Consistency, and k-Richness are necessary to characterize 
Max-Sum. 

Proof For each of the mentioned properties, we show that all the other properties and 
axioms together are not enough to uniquely characterize Max-Sum. To this end, for each of 
property, we exhibit an algorithm that acts differently than Max-Sum, and satisfies all the 
properties except for one. In other words, we show that without each of these properties, 
the remaining ones do not uniquely characterize Max-Sum. 

Consistency is necessary. We define the Minimum Cut Tree Cuts family of partitioning 
functions. As usual, the task is to partition n points into k clusters. Let a be a permutation 
function for the set of all A;-partitionings of S. A particular member of the MCT cuts family 
computes the Max-Sum ^-partitioning, then runs cr on the output of Max-Sum. The entire 
family is obtained by varying the particular permutation used. Since Max-Sum is A:-Rich, 
and cr is a bijection, then all the members of MCT-Cuts are also /c-Rich. Since Max-Sum is 
MCT-Consistent and a does not look at the input, all the members of MCT-Cuts are also 
MCT-Consistent. However, it is not true that all members MCT-Cuts are Consistent. This 
is implied by theorem [U which says the only Consistent member of MST-Cuts is Max-Sum 
itself, i.e. the case that a is the identity permutation. 

MCT- Consistency is necessary. Consider that Single-Linkage satisfies Consistency, 
Scale-Invariance, and fc-Richess, but is obviously not the same function as Max-Sum. Thus 
MCT-Consistency is necessary. 

k-Richness is necessary. Now consider the Constant clustering function which always 
returns the first n — A; + 1 elements of S* as a single cluster and returns the remaining k as 
singleton clusters (a singleton is a cluster with a single point in it), making a total of k clus- 
ters. Because this function does not look at s, it is trivially MST-Consistent, Consistent, 
and Scale-Invariant. However, it is not A;-Rich because it always returns some singletons - 
i.e. we could never reach a fc-partitioning that has no singletons. ■ 

Theorem 4 Consistency, M ST- Consistency, and k-Richness are necessary to characterize 
Single-Linkage. 

Proof omitted for space, but it very similar to proof of theorem [3l A summary of 
these results is available in tabled! Now that we have seen our properties do not trivially 
characterize neither Single-Linkage nor Max-Sum, we can move onto proving the uniqueness 
theorems. 

Lemma 5 Given a Consistent partitioning function F, and a similarity function s with 
edges in descending order of similarity 

s = (ei,e2, . . . ,ep,eq, . . . e^n^) 
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Consistency 


fc-Richness 


MST-Consistency 


MCT-Consistency 


Single-Linkage 


/ 


/ 


/ 


X 


Max- Sum 


/ 


/ 


X 


/ 


MST cuts family 


X 


/ 


/ 


X 


MCT cuts family 


X 


/ 


X 


/ 


Constant partitioning 


/ 


X 


/ 


/ 



Table 1: Overview of discussed partitioning functions. Even if one were to consider more parti- 
tioning functions, as a consequence of theorems [7] and [51 the Single-Linkage and Max-Sum 
rows are unique amongst all partitioning functions. 



then for all k > 0, if Cp and eg are both inner edges or both outer edges (w.r.t. F(s,k)), we 
have 

F((ei, 62, . . . , Cg, Bp, . . . e^^), /c) = F{s, k) 

Proof In other words, whenever we have two edges of the same type (inner or outer), in 
neighboring positions in the edge ordering of s, we can swap their positions while main- 
taining the output of F. This is true because if both ep and are outer edges, then 
we can shrink Cp until w{ep) < w{eq) all the while preserving the output of F (by Consis- 
tency). Similarly, if both ep and eg are inner edges, we can expand eg until w{ep) < w{eq). ■ 

Theorem 6 Max- Sum is the only MCT- Consistent Clustering function. 

Proof Let F be any Consistent, A;-Rich, MCT-Consistent clustering function, and let s 
be any similarity function on n points, k is an integer with \ < k < n. We want to show 
that for all s^k, F(s,k) = M.S{s,k), where MS(s,A;) is the result of Max-Sum on s. For 
this purpose, we introduce the partitioning F as whatever the output of MS is on s, so 
MS(s, k) = F. Whenever we say "inner" or "outer" edge for this proof, we mean with 
respect to F. 

By A;-Richness of F, there exists an si such that F{si, k) = MS(s, k) = F. Now, through 
a series of transformations that preserve the output of F, we transform si into S2, then S2 
into ss, . . ., until we arrive at s. 

1. By /c-Richness, we know there exists an si such that F{si, k) = MS(s, k) = F. 

2. Using scale-invariance, we can linearly shrink all edges in s\ until they are all smaller 
than the smallest edge in si, call the result S2- 

3. Now using consistency, we can expand all inner edges of si until they are exactly of 
the same weight as they appear in s. Call the result 53. Thus, S3 has all inner edges 
set to exactly the same weight as in s, but the outer edges of 53 are all smaller than 
their counterparts in s. 

4. Using consistency, we can shrink all outer edges of S3 until the sum of all outer edges 
of S3 is smaller than the smallest inner edge of S3 (and s). Call the result S4. 
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5. By lemma [5] we can reorder all outer edges in 54 until their order among themselves 
is the same order as they appear in s. Call the result S5. 

6. For this step it helps to review the construction of Gomory-Hu trees outlined in 
section 11.2. 1[ Consider that the sum of all outer edges in 55 is less than any single 
inner edge in 55. Thus cutting all outer edges in 55 is cheaper than cutting any 
single inner edge. Furthermore, the removal of these outer edges in Gg^ results in k 
connected components. To construct a Gomory-Hu tree for Gs^, we build the tree 
by maintaining the Gomory-Hu invariant while querying points in differing clusters 
of r with each step. The standard s-t MinCut algorithm in the Gomory-Hu iteration 
is guaranteed to return a subset of the outer edges of S5, since the sum of all outer 
edges has less weight than any individual inner edge. So after the first k — 1 iterations 
of Gomory-Hu, the intermediate tree T will have k — I edges and k supernodes, and 
the weight of the edges of T will be smaller than any inner edge of s. We then run 
Gomory-Hu to completion to obtain the T^g = MCT(s5). We do the same for s, and 
call the result Tg. 

7. Since Ts and Tg^ are trees, their minimum fc-cut is given by cutting their k — 1 smallest 
edges respectively. However, since both of their k — 1 lightest edges correspond to 
cutting the outer edges of S5, then Tg and T^g have the same minimum fc-cut. Thus 
by MCT-Consistency we can transform 55 to s while maintaining the output of F. 

8. Thus we have F^s^, k) = F{s, k) = T. 

We started with any s, and showed that F{s, k) = T = MS(s, k). We also know that Max- 
Sum satisfies all 3 axioms and MCT-Consistency. Thus it is uniquely characterized. ■ 



Theorem 7 Single-Linkage is the only M ST- Consistent Clustering function. 
Proof 

Let F be any Consistent, A;-Rich, MST-Consistent clustering function, and let s be 
any similarity function on n points, k is an integer with 1 < k < n. We want to show 
that for all s,k, F{s,k) = SL(s,fe), where SL(s,A;) is the result of Single-Linkage on s. 
For this purpose, we introduce the partitioning F as whatever the output of SL is on s, 
so SIj{s,k) = r. Whenever we say "inner" or "outer" edge for this proof, we mean with 
respect to F. 

By /c-Richness of F, there exists an si such that F{si, k) = SL(s, k) = F. Now, through 
a series of transformations that preserve the output of F, we transform si into S2, then S2 
into S3, . . ., until we arrive at s. 

1. By A;-Richness, we know there exists an si such that F{si, k) = SL(s, k) = F. 

2. Using scale-invariance, we can linearly shrink all edges in si until they are all smaller 
than the smallest edge in si, call the result S2- 

3. Now using consistency, we can expand all inner edges of si until they are exactly of 
the same weight as they appear in s. Call the result S3. Thus, S3 has all inner edges 
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set to exactly the same weight as in s, but the outer edges of S3 are all smaller than 
their counterparts in s. 

4. Using consistency, we can shrink all outer edges of S3 until the sum of all outer edges 
of S3 is smaller than the smallest inner edge of S3 (and s). Call the result S4. 

5. By lemma [5] we can reorder all outer edges in S4 until their order among themselves 
is the same order they appear in s, we can do the same for the inner edges. Call the 
result S5. 

6. By Kruskal's algorithm for MST, s and S5 have the same MST, call them Tg and 
Tgg. Since Tg and r<jg are trees, their minimum fc-cut is given by cutting their k — 1 
smallest edges respectively. However, since both of their k—1 lightest edges correspond 
to cutting the outer edges of S5, then Tg and T^g have the same minimum fe-cut. Thus 
by MST-Consistency we can transform S5 to s while maintaining the output of F. 



7. Thus we have F{s^, k) = F{s, k) = T. 

We started with any s, and s howed that F{s 
Linkage satisfies all 3 axioms 
is uniquely characterized. 



3. Conclusions & Future directions 



= r = SLf s, k). We also know that Single- 
20091 ^ and MST-Consistency. Thus it 



Zadeh and Ben-David. 



In this paper we have characterized two clustering algorithms which are usually treated 
separately with differing motivating principles. Using our framework, one can aim to build a 
suite of abstract properties of clustering that will induce a taxonomy of clustering paradigms. 
Such a taxonomy should serve to help utilize prior domain knowledge to allow educated 
choice of a clustering method that is appropriate for a given clustering task. 

The chief contributions of this paper are the characterizations of clustering through as- 
sociated tree constructions. The tree construction relevant for Max-Sum turned out to be 
the Gomory-Hu tree, and the tree construction for Single-Linkage was the Maximum Span- 
ning Tree. It is expected that the general Gomory-Hu tree construction for any submodular 
function can also be used for characterizing other objectives, but it is not immediately clear 
how to change axioms such as Consistency to achieve this and we leave it for future work. 

Our contribution provides insight into the connection between Submodularity, Single- 
Linkage, and Max-Sum clustering functions. For the latter two, we show they are axiomati- 
cally identical except for one property. By considering the listing in table [H we demonstrate 
the type of desired taxonomy of clustering functions based on the properties each satisfies. 

Although at first glance it seems these properties are 'obviously' all that are necessary 
to characterize Max-Sum and Single-Linkage, we show that this is not the case, by way 
of Theorems [3] and HI To investigate the ramifications of algorithms which satisfy only a 
subset of our properties, we introduced the MCT Cuts family of partitioning functions, of 
which Max-Sum is the only Consistent member. The uniqueness theorems came about as 
a result of forcing functions to focus on Minimum/Maximum Cut/Spanning Trees. 
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